『Data Quality Fundamentals』

https://gyazo.com/f7cded0076b363fed5d8f8714cb47542

2022/10/11

Barr Moses 著

Lior Gavish 著

Molly Vorwerck 著

Oreilly & Associates Inc

オライリーのサブスク ref

ここで見た

Preface

Conventions Used in This Book

Using Code Examples

O’Reilly Online Learning

How to Contact Us

Acknowledgments

1. Why Data Quality Deserves Attention—Now

What Is Data Quality?

Framing the Current Moment

Understanding the “Rise of Data Downtime”

Other Industry Trends Contributing to the Current Moment

Summary

2. Assembling the Building Blocks of a Reliable Data System

Understanding the Difference Between Operational and Analytical Data

What Makes Them Different?

Data Warehouses Versus Data Lakes

Data Warehouses: Table Types at the Schema Level

Data Lakes: Manipulations at the File Level

What About the Data Lakehouse?

Syncing Data Between Warehouses and Lakes

Collecting Data Quality Metrics

What Are Data Quality Metrics?

How to Pull Data Quality Metrics

Using Query Logs to Understand Data Quality in the Warehouse

Using Query Logs to Understand Data Quality in the Lake

Designing a Data Catalog

Building a Data Catalog

Summary

3. Collecting, Cleaning, Transforming, and Testing Data

Collecting Data

Application Log Data

API Responses

Sensor Data

Cleaning Data

Batch Versus Stream Processing

Data Quality for Stream Processing

Normalizing Data

Handling Heterogeneous Data Sources

Schema Checking and Type Coercion

Syntactic Versus Semantic Ambiguity in Data

Managing Operational Data Transformations Across AWS Kinesis and Apache Kafka

Running Analytical Data Transformations

Ensuring Data Quality During ETL

Ensuring Data Quality During Transformation

Alerting and Testing

dbt Unit Testing

Great Expectations Unit Testing

Deequ Unit Testing

Managing Data Quality with Apache Airflow

Scheduler SLAs

Installing Circuit Breakers with Apache Airflow

SQL Check Operators

Summary

4. Monitoring and Anomaly Detection for Your Data Pipelines

Knowing Your Known Unknowns and Unknown Unknowns

Building an Anomaly Detection Algorithm

Monitoring for Freshness

Understanding Distribution

Building Monitors for Schema and Lineage

Anomaly Detection for Schema Changes and Lineage

Visualizing Lineage

Investigating a Data Anomaly

Scaling Anomaly Detection with Python and Machine Learning

Improving Data Monitoring Alerting with Machine Learning

Accounting for False Positives and False Negatives

Improving Precision and Recall

Detecting Freshness Incidents with Data Monitoring

F-Scores

Does Model Accuracy Matter?

Beyond the Surface: Other Useful Anomaly Detection Approaches

Designing Data Quality Monitors for Warehouses Versus Lakes

Summary

5. Architecting for Data Reliability

Measuring and Maintaining High Data Reliability at Ingestion

Measuring and Maintaining Data Quality in the Pipeline

Understanding Data Quality Downstream

Building Your Data Platform

Data Ingestion

Data Storage and Processing

Data Transformation and Modeling

Business Intelligence and Analytics

Data Discovery and Governance

Developing Trust in Your Data

Data Observability

Measuring the ROI on Data Quality

How to Set SLAs, SLOs, and SLIs for Your Data

Case Study: Blinkist

Summary

6. Fixing Data Quality Issues at Scale

Fixing Quality Issues in Software Development

Data Incident Management

Incident Detection

Response

Root Cause Analysis

Resolution

Blameless Postmortem

Incident Response and Mitigation

Establishing a Routine of Incident Management

Why Data Incident Commanders Matter

Case Study: Data Incident Management at PagerDuty

The DataOps Landscape at PagerDuty

Data Challenges at PagerDuty

Using DevOps Best Practices to Scale Data Incident Management

Summary

7. Building End-to-End Lineage

Building End-to-End Field-Level Lineage for Modern Data Systems

Basic Lineage Requirements

Data Lineage Design

Parsing the Data

Building the User Interface

Case Study: Architecting for Data Reliability at Fox

Exercise “Controlled Freedom” When Dealing with Stakeholders

Invest in a Decentralized Data Team

Avoid Shiny New Toys in Favor of Problem-Solving Tech

To Make Analytics Self-Serve, Invest in Data Trust

Summary

8. Democratizing Data Quality

Treating Your “Data” Like a Product

Perspectives on Treating Data Like a Product

Convoy Case Study: Data as a Service or Output

Uber Case Study: The Rise of the Data Product Manager

Applying the Data-as-a-Product Approach

Building Trust in Your Data Platform

Align Your Product’s Goals with the Goals of the Business

Gain Feedback and Buy-in from the Right Stakeholders

Prioritize Long-Term Growth and Sustainability Versus Short-Term Gains

Sign Off on Baseline Metrics for Your Data and How You Measure Them

Know When to Build Versus Buy

Assigning Ownership for Data Quality

Chief Data Officer

Business Intelligence Analyst

Analytics Engineer

Data Scientist

Data Governance Lead

Data Engineer

Data Product Manager

Who Is Responsible for Data Reliability?

Creating Accountability for Data Quality

Balancing Data Accessibility with Trust

Certifying Your Data

Seven Steps to Implementing a Data Certification Program

Case Study: Toast’s Journey to Finding the Right Structure for Their Data Team

In the Beginning: When a Small Team Struggles to Meet Data Demands

Supporting Hypergrowth as a Decentralized Data Operation

Regrouping, Recentralizing, and Refocusing on Data Trust

Considerations When Scaling Your Data Team

Increasing Data Literacy

Prioritizing Data Governance and Compliance

Prioritizing a Data Catalog

Beyond Catalogs: Enforcing Data Governance

Building a Data Quality Strategy

Make Leadership Accountable for Data Quality

Set Data Quality KPIs

Spearhead a Data Governance Program

Automate Your Lineage and Data Governance Tooling

Create a Communications Plan

Summary

9. Data Quality in the Real World: Conversations and Case Studies

Building a Data Mesh for Greater Data Quality

Domain-Oriented Data Owners and Pipelines

Self-Serve Functionality

Interoperability and Standardization of Communications

Why Implement a Data Mesh?

To Mesh or Not to Mesh? That Is the Question

Calculating Your Data Mesh Score

A Conversation with Zhamak Dehghani: The Role of Data Quality Across the Data Mesh

Can You Build a Data Mesh from a Single Solution?

Is Data Mesh Another Word for Data Virtualization?

Does Each Data Product Team Manage Their Own Separate Data Stores?

Is a Self-Serve Data Platform the Same Thing as a Decentralized Data Mesh?

Is the Data Mesh Right for All Data Teams?

Does One Person on Your Team “Own” the Data Mesh?

Does the Data Mesh Cause Friction Between Data Engineers and Data Analysts?

Case Study: Kolibri Games’ Data Stack Journey

First Data Needs

Pursuing Performance Marketing

2018: Professionalize and Centralize

Getting Data-Oriented

Getting Data-Driven

Building a Data Mesh

Five Key Takeaways from a Five-Year Data Evolution

Making Metadata Work for the Business

Unlocking the Value of Metadata with Data Discovery

Data Warehouse and Lake Considerations

Data Catalogs Can Drown in a Data Lake—or Even a Data Mesh

Moving from Traditional Data Catalogs to Modern Data Discovery

Deciding When to Get Started with Data Quality at Your Company

You’ve Recently Migrated to the Cloud

Your Data Stack Is Scaling with More Data Sources, More Tables, and More Complexity

Your Data Team Is Growing

Your Team Is Spending at Least 30% of Their Time Firefighting Data Quality Issues

Your Team Has More Data Consumers Than They Did One Year Ago

Your Company Is Moving to a Self-Service Analytics Model

Data Is a Key Part of the Customer Value Proposition

Data Quality Starts with Trust

Summary

10. Pioneering the Future of Reliable Data Systems

Be Proactive, Not Reactive

Predictions for the Future of Data Quality and Reliability

Data Warehouses and Lakes Will Merge

Emergence of New Roles on the Data Team

Rise of Automation

More Distributed Environments and the Rise of Data Domains

So Where Do We Go from Here?

Index

About the Authors